{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exact GP Regression with Multiple GPUs and Kernel Partitioning\n",
    "## Introduction\n",
    "In this notebook, we'll demonstrate training exact GPs on large datasets using two key features from the paper https://arxiv.org/abs/1903.08114: \n",
    "\n",
    "1. The ability to distribute the kernel matrix across multiple GPUs, for additional parallelism.\n",
    "2. Partitioning the kernel into chunks computed on-the-fly when performing each MVM to reduce memory usage.\n",
    "\n",
    "We'll be using the `protein` dataset, which has about 37000 training examples. The techniques in this notebook can be applied to much larger datasets, but the training time required will depend on the computational resources you have available: both the number of GPUs available and the amount of memory they have (which determines the partition size) have a significant effect on training time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import math\n",
    "import torch\n",
    "import gpytorch\n",
    "import sys\n",
    "from matplotlib import pyplot as plt\n",
    "sys.path.append('../')\n",
    "from LBFGS import FullBatchLBFGS\n",
    "\n",
    "%matplotlib inline\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will be using the Protein UCI dataset which contains a total of 40000+ data points. The next cell will download this dataset from a Google drive and load it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import urllib.request\n",
    "from scipy.io import loadmat\n",
    "dataset = 'protein'\n",
    "if not os.path.isfile(f'../{dataset}.mat'):\n",
    "    print(f'Downloading \\'{dataset}\\' UCI dataset...')\n",
    "    urllib.request.urlretrieve('https://drive.google.com/uc?export=download&id=1nRb8e7qooozXkNghC5eQS0JeywSXGX2S',\n",
    "                               f'../{dataset}.mat')\n",
    "    \n",
    "data = torch.Tensor(loadmat(f'../{dataset}.mat')['data'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Normalization and train/test Splits\n",
    "\n",
    "In the next cell, we split the data 80/20 as train and test, and do some basic z-score feature normalization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "N = data.shape[0]\n",
    "# make train/val/test\n",
    "n_train = int(0.8 * N)\n",
    "train_x, train_y = data[:n_train, :-1], data[:n_train, -1]\n",
    "test_x, test_y = data[n_train:, :-1], data[n_train:, -1]\n",
    "\n",
    "# normalize features\n",
    "mean = train_x.mean(dim=-2, keepdim=True)\n",
    "std = train_x.std(dim=-2, keepdim=True) + 1e-6 # prevent dividing by 0\n",
    "train_x = (train_x - mean) / std\n",
    "test_x = (test_x - mean) / std\n",
    "\n",
    "# normalize labels\n",
    "mean, std = train_y.mean(),train_y.std()\n",
    "train_y = (train_y - mean) / std\n",
    "test_y = (test_y - mean) / std\n",
    "\n",
    "# make continguous\n",
    "train_x, train_y = train_x.contiguous(), train_y.contiguous()\n",
    "test_x, test_y = test_x.contiguous(), test_y.contiguous()\n",
    "\n",
    "output_device = torch.device('cuda:0')\n",
    "\n",
    "train_x, train_y = train_x.to(output_device), train_y.to(output_device)\n",
    "test_x, test_y = test_x.to(output_device), test_y.to(output_device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How many GPUs do you want to use?\n",
    "\n",
    "In the next cell, specify the `n_devices` variable to be the number of GPUs you'd like to use. By default, we will use all devices available to us."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Planning to run on 2 GPUs.\n"
     ]
    }
   ],
   "source": [
    "n_devices = torch.cuda.device_count()\n",
    "print('Planning to run on {} GPUs.'.format(n_devices))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next cell we define our GP model and training code. For this notebook, the only thing different from the Simple GP tutorials is the use of the `MultiDeviceKernel` to wrap the base covariance module. This allows for the use of multiple GPUs behind the scenes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "class ExactGPModel(gpytorch.models.ExactGP):\n",
    "    def __init__(self, train_x, train_y, likelihood, n_devices):\n",
    "        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)\n",
    "        self.mean_module = gpytorch.means.ConstantMean()\n",
    "        base_covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())\n",
    "        \n",
    "        self.covar_module = gpytorch.kernels.MultiDeviceKernel(\n",
    "            base_covar_module, device_ids=range(n_devices),\n",
    "            output_device=output_device\n",
    "        )\n",
    "    \n",
    "    def forward(self, x):\n",
    "        mean_x = self.mean_module(x)\n",
    "        covar_x = self.covar_module(x)\n",
    "        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)\n",
    "\n",
    "def train(train_x,\n",
    "          train_y,\n",
    "          n_devices,\n",
    "          output_device,\n",
    "          checkpoint_size,\n",
    "          preconditioner_size,\n",
    "          n_training_iter,\n",
    "):\n",
    "    likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)\n",
    "    model = ExactGPModel(train_x, train_y, likelihood, n_devices).to(output_device)\n",
    "    model.train()\n",
    "    likelihood.train()\n",
    "    \n",
    "    optimizer = FullBatchLBFGS(model.parameters(), lr=0.1)\n",
    "    # \"Loss\" for GPs - the marginal log likelihood\n",
    "    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)\n",
    "\n",
    "    \n",
    "    with gpytorch.beta_features.checkpoint_kernel(checkpoint_size), \\\n",
    "         gpytorch.settings.max_preconditioner_size(preconditioner_size):\n",
    "\n",
    "        def closure():\n",
    "            optimizer.zero_grad()\n",
    "            output = model(train_x)\n",
    "            loss = -mll(output, train_y)\n",
    "            return loss\n",
    "\n",
    "        loss = closure()\n",
    "        loss.backward()\n",
    "\n",
    "        for i in range(n_training_iter):\n",
    "            options = {'closure': closure, 'current_loss': loss, 'max_ls': 10}\n",
    "            loss, _, _, _, _, _, _, fail = optimizer.step(options)\n",
    "            \n",
    "            print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (\n",
    "                i + 1, n_training_iter, loss.item(),\n",
    "                model.covar_module.module.base_kernel.lengthscale.item(),\n",
    "                model.likelihood.noise.item()\n",
    "            ))\n",
    "            \n",
    "            if fail:\n",
    "                print('Convergence reached!')\n",
    "                break\n",
    "    \n",
    "    print(f\"Finished training on {train_x.size(0)} data points using {n_devices} GPUs.\")\n",
    "    return model, likelihood"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Automatically determining GPU Settings\n",
    "\n",
    "In the next cell, we automatically determine a roughly reasonable partition or *checkpoint* size that will allow us to train without using more memory than the GPUs available have. Not that this is a coarse estimate of the largest possible checkpoint size, and may be off by as much as a factor of 2. A smarter search here could make up to a 2x performance improvement."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of devices: 2 -- Kernel partition size: 0\n",
      "RuntimeError: CUDA out of memory. Tried to allocate 2.49 GiB (GPU 1; 10.73 GiB total capacity; 7.48 GiB already allocated; 2.46 GiB free; 21.49 MiB cached)\n",
      "Number of devices: 2 -- Kernel partition size: 18292\n",
      "RuntimeError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 0; 10.73 GiB total capacity; 6.37 GiB already allocated; 448.94 MiB free; 1.30 GiB cached)\n",
      "Number of devices: 2 -- Kernel partition size: 9146\n",
      "Iter 1/1 - Loss: 0.893   lengthscale: 0.486   noise: 0.248\n",
      "Finished training on 36584 data points using 2 GPUs.\n"
     ]
    }
   ],
   "source": [
    "import gc\n",
    "\n",
    "def find_best_gpu_setting(train_x,\n",
    "                          train_y,\n",
    "                          n_devices,\n",
    "                          output_device,\n",
    "                          preconditioner_size\n",
    "):\n",
    "    N = train_x.size(0)\n",
    "    \n",
    "    # Find the optimum partition/checkpoint size by decreasing in powers of 2\n",
    "    # Start with no partitioning (size = 0)\n",
    "    settings = [0] + [int(n) for n in np.ceil(N / 2**np.arange(1, np.floor(np.log2(N))))]\n",
    "\n",
    "    for checkpoint_size in settings:\n",
    "        print('Number of devices: {} -- Kernel partition size: {}'.format(n_devices, checkpoint_size))\n",
    "        try:\n",
    "            # Try a full forward and backward pass with this setting to check memory usage\n",
    "            _, _ = train(train_x, train_y,\n",
    "                         n_devices=n_devices, output_device=output_device,\n",
    "                         checkpoint_size=checkpoint_size,\n",
    "                         preconditioner_size=preconditioner_size, n_training_iter=1)\n",
    "            \n",
    "            # when successful, break out of for-loop and jump to finally block\n",
    "            break\n",
    "        except RuntimeError as e:\n",
    "            print('RuntimeError: {}'.format(e))\n",
    "        except AttributeError as e:\n",
    "            print('AttributeError: {}'.format(e))\n",
    "        finally:\n",
    "            # handle CUDA OOM error\n",
    "            gc.collect()\n",
    "            torch.cuda.empty_cache()\n",
    "    return checkpoint_size\n",
    "\n",
    "# Set a large enough preconditioner size to reduce the number of CG iterations run\n",
    "preconditioner_size = 100\n",
    "checkpoint_size = find_best_gpu_setting(train_x, train_y,\n",
    "                                        n_devices=n_devices, \n",
    "                                        output_device=output_device,\n",
    "                                        preconditioner_size=preconditioner_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iter 1/20 - Loss: 0.897   lengthscale: 0.486   noise: 0.253\n",
      "Iter 2/20 - Loss: 0.892   lengthscale: 0.354   noise: 0.144\n",
      "Iter 3/20 - Loss: 0.859   lengthscale: 0.305   noise: 0.125\n",
      "Iter 4/20 - Loss: 0.859   lengthscale: 0.292   noise: 0.116\n",
      "Iter 5/20 - Loss: 0.862   lengthscale: 0.292   noise: 0.116\n",
      "Convergence reached!\n",
      "Finished training on 36584 data points using 2 GPUs.\n"
     ]
    }
   ],
   "source": [
    "model, likelihood = train(train_x, train_y,\n",
    "                          n_devices=n_devices, output_device=output_device,\n",
    "                          checkpoint_size=10000,\n",
    "                          preconditioner_size=100,\n",
    "                          n_training_iter=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing test time caches"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get into evaluation (predictive posterior) mode\n",
    "model.eval()\n",
    "likelihood.eval()\n",
    "\n",
    "with torch.no_grad(), gpytorch.settings.fast_pred_var(), gpytorch.beta_features.checkpoint_kernel(1000):\n",
    "    # Make predictions on a small number of test points to get the test time caches computed\n",
    "    latent_pred = model(test_x[:2, :])\n",
    "    del latent_pred  # We don't care about these predictions, we really just want the caches."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Testing: Computing predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.1 s, sys: 728 ms, total: 1.83 s\n",
      "Wall time: 1.88 s\n",
      "Test RMSE: 0.551821768283844\n"
     ]
    }
   ],
   "source": [
    "with torch.no_grad(), gpytorch.settings.fast_pred_var(), gpytorch.beta_features.checkpoint_kernel(1000):\n",
    "    %time latent_pred = model(test_x)\n",
    "    \n",
    "test_rmse = torch.sqrt(torch.mean(torch.pow(latent_pred.mean - test_y, 2)))\n",
    "print(f\"Test RMSE: {test_rmse.item()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
